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Abstract 

We analyze the convergence of gradient-based optimization algorithms that base their up- 
dates on delayed stochastic gradient information. The main application of our results is to 
the development of gradient-based distributed optimization algorithms where a master node 
. performs parameter updates while worker nodes compute stochastic gradients based on local 

Oh' information in parallel, which may give rise to delays due to asynchrony. We take motivation 

, from statistical problems where the size of the data is so large that it cannot fit on one computer; 

with the advent of huge datasets in biology, astronomy, and the internet, such problems are now 
, common. Our main contribution is to show that for smooth stochastic problems, the delays are 

asymptotically negligible and we can achieve order-optimal convergence results. In application 
to distributed optimization, we develop procedures that overcome communication bottlenecks 
and synchronization requirements. We show n-node architectures whose optimization error in 
stochastic problems — in spite of asynchronous delays — scales asymptotically as 0(l/VnT) after 
T iterations. This rate is known to be optimal for a distributed system with n nodes even 
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in the absence of delays. We additionally complement our theoretical results with numerical 
experiments on a statistical machine learning task. 

1 Introduction 

> 

We focus on stochastic convex optimization problems of the form 

in . , 

^ minimize /(x) for f(x) := K P [F(x; £)] = / F(x; £)dP(£), (1) 

o 

where X C M d is a closed convex set, P is a probability distribution over H, and F(- ; £) is convex 
for all £ S H, so that / is convex. The goal is to find a parameter x that approximately minimizes / 
over x £ X. Classical stochastic gradient algorithms |RM51j lPo!87| iteratively update a parameter 
x(t) £ X by sampling £ ~ P, computing g(t) = VF(x(i); £), and performing the update x(t+ 1) = 
Hx(x(t) — a(t)g(t)), where Hx denotes projection onto the set X. In this paper, we analyze 
asynchronous gradient methods, where instead of receiving current information g(t), the procedure 
receives out of date gradients g(t—r(t)) = VF(x(t — r(t)),^), where r(i) is the (potentially random) 
delay at time t. The central contribution of this paper is to develop algorithms that — under natural 
assumptions about the functions F in the objective (pQ) — achieve asymptotically optimal rates for 
stochastic convex optimization in spite of delays. 

Our model of delayed gradient information is particularly relevant in distributed optimization 
scenarios, where a master maintains the parameters x while workers compute stochastic gradients of 
the objective (pQ). The architectural assumption of a master with several worker nodes is natural for 
distributed computation, and other researchers have considered models similar to those in this pa- 
per [NBB014ILSZ09| . By allowing delayed and asynchronous updates, we can avoid synchronization 
issues that commonly handicap distributed systems. 
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Figure 1. Cyclic delayed update architecture. Workers compute gradients cyclically and in parallel, 
passing out-of-date information to master. Master responds with current parameters. Diagram shows 
parameters and gradients communicated between rounds t and t + n— 1 . 



Certainly distributed optimization has been studied for several decades, tracing back at least 
to seminal work of Tsitsiklis and colleagues ([Tsi84j BT89]) on minimization of smooth functions 
where the parameter vector is distributed. More recent work has studied problems in which each 
processor or node i in a network has a local function /j, and the goal is to minimize the sum 
fix) = jEti/iW |N()09l iBNVlOl I.TR.J09L iDAWlOj . Most prior work assumes as a constraint 
that data lies on several different nodes throughout a network. However, as Dekel et al. [DGBSXIOa 
first noted, in distributed stochastic settings independent realizations of a stochastic gradient can 
be computed concurrently, and it is thus possible to obtain an aggregated gradient estimate with 
lower variance. Using modern stochastic optimization algorithms (e.g. jJNT08t ILanlOj ). Dekel et 
al. give a series of reductions to show that in an n-node network it is possible to achieve a speedup 
of 0(n) over a single-processor so long as the objective / is smooth. 

Our work is closest to Nedic et al.'s asynchronous subgradient method [NBBOlJ, which is an 
incremental gradient procedure in which gradient projection steps are taken using out-of-date gradi- 
ents. See Figure [U for an illustration. The asynchronous subgradient method performs non-smooth 
minimization and suffers an asymptotic penalty in convergence rate due to the delays: if the gradi- 
ents are computed with a delay of r, then the convergence rate of the procedure is 0(Wt/T). The 
setting of distributed optimization provides an elegant illustration of the role played by the delay 
in convergence rates. As in Fig. [H the delay r can essentially be of order n in Nedic et al.'s setting, 
which gives a convergence rate of 0(y/n/T). A simple centralized stochastic gradient algorithm at- 
tains a rate of 0(l/yT), which suggests something is amiss in the distributed algorithm. Langford 
et al. [LSZ09] rediscovered Nedic et al.'s results and attempted to remove the asymptotic penalty 
by considering smooth objective functions, though their approach has a technical error (see Ap- 
pendix [C]), and even so they do not demonstrate any provable benefits of distributed computation. 
We analyze similar asynchronous algorithms, but we show that for smooth stochastic problems the 
delay is asymptotically negligible — the time r does not matter — and in fact, with parallelization, 
delayed updates can give provable performance benefits. 

We build on results of Dekel et al. [DGBSXIOaJ, who show that when the objective / has 
Lipschitz-continuous gradients, then when n processors compute stochastic gradients in parallel 
using a common parameter x it is possible to achieve convergence rate 0(l/y/Tn) so long as the 
processors are synchronized (under appropriate synchrony conditions, this holds nearly indepen- 
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dently of network topology). A variant of their approach is asymptotically robust to asynchrony 
so long as most processors remain synchronized for most of the time [DGBSXIObj. We show re- 
sults similar to their initial discovery, but we analyze the effects of asynchronous gradient updates 
where all the nodes in the network can suffer delays. Application of our main results to the dis- 
tributed setting provides convergence rates in terms of the number of nodes n in the network and 
the stochastic process governing the delays. Concretely, we show that under different assumptions 
on the network and delay process, we achieve convergence rates ranging from 0(n 3 /T + l/VTn) to 
0(n/T + 1/y/Tn), which is 0(1/ V nT) asymptotically in T. For problems with large n, we demon- 
strate faster rates ranging from 0((n/T) 2 / 3 + l/y/Tn) to 0(1/T 2 / 3 + 1/y/Tn). In either case, the 
time necessary to achieve e-optimal solution to the problem (pQ) is asymptotically 0(l/ne 2 ), a factor 
of n — the size of the network — better than a centralized procedure in spite of delay. 

The remainder of the paper is organized as follows. We begin by reviewing known algorithms 
for solving the stochastic optimization problem (pTJ) and stating our main assumptions. Then in 
Section [3] we give abstract descriptions of our algorithms and state our main theoretical results, 
which we make concrete in Section [4] by formally placing the analysis in the setting of distributed 
stochastic optimization. We complement the theory in Section [5] with experiments on a real- world 
dataset, and proofs follow in the remaining sections. 

Notation For the reader's convenience, we collect our (mostly standard) notation here. We 
denote general norms by ||-||, and the dual norm to the norm ||-|| is defined as {{z^ := 
sup a ,.|| a ,|| <1 (z,x). The subdifferential set of a function / is 

df(x) := {g£R d \ f(y) > f(x) + (g,y - x) for all y G dom/} 

We use the shorthand ^/(x)!^ := sup ge gj:/ x ^ \\g\\^. A function / is G-Lipschitz with respect to 
the norm ||-|| on X if for all x, y G X, \ f(x) — f(y)\ < G\\x — y\\. For convex /, this is equivalent 
to [|9/(sc)[|» < G for all x G X (e.g. [HUL96a] ). A function / is L-smooth on X if V/ is Lipschitz 
continuous with respect to the norm ||-||, defined as 

L 2 

l|V/(x) - V/(y)||* < L \\x - y\\ , equivalently, f(y) < f(x) + (Vf(x), y - x) + - \\x - y\\ . 

For convex differentiable h, the Bregman divergence |Bre67] between x and y is defined as 

D h (x, y) := h{x) - h(y) - (Vh(y),x - y) . (2) 
A convex function h is c-strongly convex with respect to a norm ||-|| over X if 

h(y) > h(x) + (g, y — x) + — \\x — y\\ 2 for all x, y £ X and g G dh(x). (3) 

We use [n] to denote the set of integers { 1 , . . . , n} . 

2 Setup and Algorithms 

In this section we set up and recall the delay-free algorithms underlying our approach. We then 
give the appropriate delayed versions of these algorithms, which we analyze in the sequel. 
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2.1 Setup and Delay-free Algorithms 

To build intuition for the algorithms we analyze, we first describe two closely related first-order 
algorithms: the dual averaging algorithm of Nesterov |Nes09| and the mirror descent algorithm 
of Nemirovski and Yudin |NY83j . which is analyzed further by Beck and Teboulle [BT03] . We 
begin by collecting notation and giving useful definitions. Both algorithms are based on a proximal 
function ip(x), where it is no loss of generality to assume that ip(x) > for all x G X. We assume 
ip is 1-strongly convex (by scaling, this is no loss of generality). By definitions ([2]) and ([3]), the 
divergence satisfies D^{x,y) > \ \\x — y\\ 2 . 

In the oracle model of stochastic optimization that we assume, at time t both algorithms query 
an oracle at the point x(t), and the oracle then samples i.i.d. from the distribution P and 
returns g(t) £ dF(x(t); The dual averaging algorithm [Nes09] updates a dual vector z(t) and 

primal vector x(t) £ X via 

z(t + 1) = z(t) + g(t) and x(t + 1) = argmin ( (z(t + l),x) + . 1 -r ip(x)\, (4) 

xex L a{t + 1) i 

while mirror descent [NY831 IBT03] performs the update 

x(t + 1) = argmin ( (g(t),x) + — ^D^(x,x(t))\. (5) 

Both make a linear approximation to the function being minimized — a global approximation in the 
case of the dual averaging update and a more local approximation for mirror descent ([5]) — while 
using the proximal function ip to regularize the points x(t). 

We now state the two essentially standard assumptions [JNT08, LanlO} IXialO| we most often 
make about the stochastic optimization problem (TjQ), after which we recall the convergence rates of 
the algorithms (j4|) and ([!]). 

Assumption A (Lipschitz Functions). For P-a.e. £ ; the function F(- ;£) is convex. Moreover, for 
any x EX, E[\\dF(x; < G 2 . 

In particular, Assumption lAl implies that / is G-Lipschitz continuous with respect to the norm ||-|| 
and that / is convex. Our second assumption has been used to show rates of convergence based on 
the variance of a gradient estimator for stochastic optimization problems (e.g. [JNT081 ILanlOj ). 

Assumption B (Smooth Functions). The function f defined in ([7]) has L-Lipschitz continuous 
gradient, and for all x G X the variance 

bound E[\\Vf(x)-VF(x; OWl) < ^ holdsE 
Several commonly used functions satisfy the above assumptions, for example: 

(i) The logistic loss: F(x;£) = log[l + exp((x,£))], the objective for logistic regression in statistics 
(e.g. [HTFOlJ). The objective F satisfies Assumptions lAl and IB1 so long as ||£|| is bounded. 

(ii) Least squares or linear regression: F(x;£) = (a — (x,b)) 2 where £ = (a, b) for a 6 W 1 and 
b € R, satisfies Assumptions lAl and IB1 as long as £ is bounded and X is compact. 

We also make a standard compactness assumption on the optimization set X . 

1 lf f is differentiable, then F(-;£) is differentiable for P-a.e. £, and conversely, but F need not be smoothly 
differentiable |Ber73| . Since VF(i;^) exists for P-a.e. ^, we will write VP(x;£) with no loss of generality. 
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Assumption C (Compactness). For x* G argmin^g^ f(x) and x S X , the bounds ip(x*) < R 2 /2 
and D^(x*,x) < R 2 both hold. 

Under Assumptions [A] or [B] in addition to Assumption [Cl the updates @ and ([5]) have known 
convergence rates. Define the time averaged vector x(T) as 

1 T 

z(T):=-J>(t + l). (6) 

i=l 

Then under Assumption [A) both algorithms satisfy 

E[f(x(T))]-f(x*) = o(^\ (7) 

for the stepsize choice a(t) = R/(Gy/t) (e.g. [Nes091 IXialOl INJLS09| ) . The result © is sharp to 
constant factors in general [NY831 IABRW10] , but can be further improved under Assumption IB"! 
Building on work of Juditsky et al. [JNT08] and Lan [LanlOj . Dekel et al. [DGBSXlOal Appendix 
A] show that under Assumptions IBl and ICl the stepsize choice a(t)^ 1 = L + n(t), where n(t) is a 
damping factor set to n (t) = aR\/t, yields for either of the updates (jl]) or ([5]) the convergence rate 

E[/(x(T))] - f(x*) = O + . (8) 

2.2 Delayed Optimization Algorithms 

We now turn to extending the dual averaging (HJ and mirror descent ([5]) updates to the setting in 
which instead descent ([5]) updates to the setting in which instead of receiving a current gradient 
g(t) at time t, the procedure receives a gradient g(t — r(t)), that is, a stochastic gradient of the 
objective ([I]) computed at the point x(t — r(t)). In the simplest case, the delays are uniform and 
r{i) = t for all t, but in general the delays may be a non-i.i.d. stochastic process. Our analysis 
admits any sequence r(i) of delays as long as the mapping t h-> r(t) satisfies E[r(i)] < B < oo. 
We also require that each update happens once, i.e., 1 1— )■ t — r(t) is one-to-one, though this second 
assumption is easily satisfied. 

Recall that the problems we consider are stochastic optimization problems of the form ([T]). 
Under the assumptions above, we extend the mirror descent and dual averaging algorithms in the 
simplest way: we replace g(t) with g{t — r(t)). For dual averaging (c.f. the update (j4j)) this yields 

z(t + 1) = z(t) + g(t - r(t)) and x(t + 1) = argmin ( (z(t + 1), x) + .} -r ip(x)\, (9) 

xex 1 &{t + 1) J 

while for mirror descent (c.f. the update ©) we have 

x(t + l) = argmin \ (g(t - r(t)),x) + -L-D^(x,x(t))\. (10) 
x&X OL(t) r > 

A generalization of Nedic et al.'s results [NBB01J by combining their techniques with the conver- 
gence proofs of dual averaging [NesOQj and mirror descent [BT03j is as follows. Under Assump- 
tions [S] and O so long as E[r(i)] < B < oo for all t, choosing a(t) = ~? gives rate 

E[/(x(T))]-/(O = 0(^^). (11) 
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3 Convergence rates for delayed optimization of smooth functions 



In this section, we state and discuss several results for asynchronous stochastic gradient methods. 
We give two sets of theorems. The first are for the asynchronous method when we make updates to 
the parameter vector x using one stochastic subgradient, according to the update rules ([9]) or (fTOlh 
The second method involves using several stochastic subgradients for every update, each with a 
potentially different delay, which gives sharper results that we present in Section 13.21 



3.1 Simple delayed optimization 



Intuitively, the v-B-penalty due to delays for non-smooth optimization arises from the fact that 
subgradients can change drastically when measured at slightly different locations, so a small delay 
can introduce significant inaccuracy. To overcome the delay penalty, we now turn to the smoothness 
assumption [B] as well as the Lipschitz condition [A] (we assume both of these conditions along with 
Assumption [O hold for all the theorems). In the smooth case, delays mean that stale gradients are 
only slightly perturbed, since our stochastic algorithms constrain the variability of the points x(t). 
As we show in the proofs of the remaining results, the error from delay essentially becomes a second 
order term: the penalty is asymptotically negligible. We study both update rules Q and (flO]) . and 
we set a(t) = j^jm ■ Here rj(t) will be chosen to both control the effects of delays and for errors 
from stochastic gradient information. We prove the following theorem in Sec. 16.11 



Theorem 1. Let the sequence x(t) be defined by the update (GJ). Define the stepsize r/(t) oc \ft + r 
or let r](t) = rj for all t. Then 



E 



The mirror descent update (jlUj) exhibits similar convergence properties, and we prove the next 
theorem in Sec. I 



Theorem 2. Use the conditions of Theorem^ but generate x(t) by the update hi (A) . Then 

T i 2 T T 



E 



£ /(*(*+!))] -Tf(x*) < 2LR 2 +R 2 [r ] {l)+r 1 {T)] + a - ^ -L+2LG 2 (r+l) 2 ^ -^—_ +2tGR 



t=i 



In each of the above theorems, we can set r)(t) = a^Jt + t/R. As immediate corollaries, we 
recall the definition © of the averaged sequence of x(t) and use convexity to see that 



V T y/T o- 2 T 

for either update rule. In addition, we can allow the delay r(t) to be random: 

Corollary 1. Let the conditions of Theorem[J\ or\^ hold, but allow r(t) to be a random mapping 
such that E[r(t) 2 ] < B 2 for all t. With the choice rj(t) = ay/T/R the updates (0|) and 110)) satisfy 
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We provide the proof of the corollary in Sec, 16.31 The take-home message from the above corollaries, 
as well as Theorems [U and [21 is that the penalty in convergence rate due to the delay r(t) is 
asymptotically negligible. As we discuss in greater depth in the next section, this has favorable 
implications for robust distributed stochastic optimization algorithms. 



3.2 Combinations of delays 

In some scenarios — including distributed settings similar to those we discuss in the next section — 
the procedure has access not to only a single delayed gradient but to several with different delays. 
To abstract away the essential parts of this situation, we assume that the procedure receives n 
gradients 51, ... ,g n , where each has a potentially different delay Now let A = (Aj)f =1 belong 
to the probability simplex, though we leave A's values unspecified for now. Then the procedure 
performs the following updates at time t: for dual averaging, 



12) 



z(t + l)=z(t)+y^X igi (t-T(i)) and x(t + l) = argmin { + 1), x) + 1 -r i/)(x)\ 

xex L a(t + 1) J 

while for mirror descent, the update is 

,(t) = y^\igi(t-r(i)) and x(t + 1) = argmin { (g\(t), x) + — r^DJx, x{t)) \. (13) 
~ x&x L alt) r > 



9x( 



i=i 



The next two theorems build on the proofs of Theorems Q] and [21 combining several techniques. We 
provide the proof of Theorem [3] in Sec. omitting the proof of Theorem 2] as it follows in a similar 
way from Theorem [21 

Theorem 3. Let the sequence x(t) be defined by the update ilty) . Under assumptions HI W\ andKX 
let = L + 77 (t) and r](t) oc y/t + r or r)(t) = 77 for all t. Then 



E 



T 



£/(s(t+l))-T/( S 



t=i 



n n X -| 

< LR 2 + V (T)R 2 + 2 V X.t^GR + 2 V A;LG 2 (r(i) + l) 2 V 



i=l 



8=1 



i n 

+ E T77Y E E W/W* " ~ 9i(t ~ r(m 

t=i ^ > i=i 

Theorem 4. Use the same conditions as Theorem [21 but assume that x(t) is defined by the up- 
date 1113]) and D^p(x*,x) < R 2 for all x eX. Then 



E 



£)/(x(t + l))-T/( 5 



,t=i 



n n 1 1 

< 2R 2 {L + n(T)) + 2 y XiT{i)GR + 2 V A;LG 2 (r(i) + l) 2 V 



i=i 



i=l 



+ E ?77T E E Mv/(*(t - *■«)) - »(* - ^(0)] 

t=l ^ ' i=l 



The consequences of Theorems [3] and H] are powerful, as we illustrate in the next section. 
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4 Distributed Optimization 



We now turn to what we see as the main purpose and application of the above results: developing 
robust and efficient algorithms for distributed stochastic optimization. Our main motivations here 
are machine learning and statistical applications where the data is so large that it cannot fit on a 
single computer. Examples of the form (TjQ) include logistic regression (for background, see [HTF01]), 
where the task is to learn a linear classifier that assigns labels in {—1, +1} to a series of examples, 
in which case we have the objective F(x;£) = log[l + exp((£, a;))] as described in Sec. I2.1||I|) ; or 
linear regression, where £ = (a, b) £ R d x R and F(x; £) = \ [b — (a, x)] 2 as described in Sec. I2.1ljn|) . 
Both objectives satisfy assumptions [A] and [B] as discussed earlier. We consider both stochastic and 
online/streaming scenarios for such problems. In the simplest setting, the distribution P in the 
objective ((TJ is the empirical distribution over an observed dataset, that is, 

1 N 

i=i 

We divide the N samples among n workers so that each worker has an iV/n-sized subset of data. 
In streaming applications, the distribution P is the unknown distribution generating the data, and 
each worker receives a stream of independent data points £ ~ P. Worker i uses its subset of the 
data, or its stream, to compute gi, an estimate of the gradient V/ of the global /. We make the 
simplifying assumption that gi is an unbiased estimate of V/(x), which is satisfied, for example, 
when each worker receives an independent stream of samples or computes the gradient gi based on 
samples picked at random without replacement from its subset of the data. 

The architectural assumptions we make are natural and based off of master /worker topologies, 
but the convergence results in Section [3] allow us to give procedures robust to delay and asynchrony. 
We consider two protocols: in the first, workers compute and communicate asynchronously and 
independently with the master, and in the second, workers are at different distances from the 
master and communicate with time lags proportional to their distances. We show in the latter part 
of this section that the convergence rates of each protocol, when applied in an n-node network, are 
0(l/\/nT) for n-node networks (though lower order terms are different for each). 

Before describing our architectures, we note that perhaps the simplest master-worker scheme is 
to have each worker simultaneously compute a stochastic gradient and send it to the master, which 
takes a gradient step on the averaged gradient. While the n gradients are computed in parallel, 
accumulating and averaging n gradients at the master takes f2(n) time, offsetting the gains of 
parallelization. Thus we consider alternate architectures that are robust to delay. 

Cyclic Delayed Architecture This protocol is the delayed update algorithm mentioned in 
the introduction, and it parallelizes computation of (estimates of) the gradient V/(cc). Formally, 
worker i has parameter x(t) and computes gi(t) = F(x(t);£i(t)), where £i(t) is a random variable 
sampled at worker i from the distribution P. The master maintains a parameter vector x £ X. 
The algorithm proceeds in rounds, cyclically pipelining updates. The algorithm begins by initiating 
gradient computations at different workers at slightly offset times. At time t, the master receives 
gradient information at a r-step delay from some worker, performs a parameter update, and passes 
the updated central parameter x(t + I) back to the worker. Other workers do not see this update 
and continue their gradient computations on stale parameter vectors. In the simplest case, each 
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x(t) /x(t - l\/x(t - 2), /'' 9{i ~ I).--"" 9(t - 2) 




(a) (b) 

Figure 2. Master-worker averaging network, (a): parameters stored at different distances from 
master node at time t. A node at distance d from master has the parameter x{t — d). (b): gradients 
computed at different nodes. A node at distance d from master computes gradient g(t — d). 

node suffers a delay of r = n, though our earlier analysis applies to random delays throughout the 
network as well. Recall Fig. [TJfor a graphic description of the process. 

Locally Averaged Delayed Architecture At a high level, the protocol we now describe com- 
bines the delayed updates of the cyclic delayed architecture with averaging techniques of previous 
work [NO09, DAW1Q]. We assume a network Q = (V,£), where V is a set of n nodes (workers) and 
£ are the edges between the nodes. We select one of the nodes as the master, which maintains the 
parameter vector x(t) G X over time. 

The algorithm works via a series of multicasting and aggregation steps on a spanning tree 
rooted at the master node. In the first phase, the algorithm broadcasts from the root towards the 
leaves. At step t the master sends its current parameter vector x(t) to its immediate neighbors. 
Simultaneously, every other node broadcasts its current parameter vector (which, for a depth d 
node, is x(t — d)) to its children in the spanning tree. See Fig. E^a). Every worker receives its new 
parameter and computes its local gradient at this parameter. The second part of the communication 
in a given iteration proceeds from leaves toward the root. The leaf nodes communicate their 
gradients to their parents. The parent takes the gradients of the leaf nodes from the previous 
round (received at iteration t — 1) and averages them with its own gradient, passing this averaged 
gradient back up the tree. Again simultaneously, each node takes the averaged gradient vectors of 
its children from the previous rounds, averages them with its current gradient vector, and passes 
the result up the spanning tree. See Fig. [2(b) and Fig. [3] for a visual description. 

Slightly more formally, associated with each node i E V is a delay r(i), which is (generally) twice 
its distance from the master. Fix an iteration t. Each node i € V has an out of date parameter 
vector x(t — r(i)/2), which it sends further down the tree to its children. So, for example, the 
master node sends the vector x(t) to its children, which send the parameter vector x(t — 1) to their 
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\ gl (t-d) 
-y 2 (t-d- 2) + y 3 {t- 



d-2) 




{x(t-d),g 2 {t-d-2),g 3 (t-d-2)} 



Depth d 



g 2 (t-d-l) g 3 (t-d-l) 
Depth d+l (^2^{x(t-d-l)} (^){x(t-d-l)} 

Figure 3. Communication of gradient information toward master node at time t from node 1 at 
distance d from master. Information stored at time t by node i in brackets to right of node. 



children, which in turn send x(t — 2) to their children, and so on. Each node computes 

9i (t - r(i)/2) = VF(x(t - T(i)/2);&(t)), 

where £i(t) is a random variable sampled at node % from the distribution P. The communication 
back up the hierarchy proceeds as follows: the leaf nodes in the tree (say at depth d) send the 
gradient vectors gi(t — d) to their immediate parents in the tree. At the previous iteration t—1, the 
parent nodes received g%(t — d— 1) from their children, which they average with their own gradients 
gi(t— d+l) and pass to their parents, and so on. The master node at the root of the tree receives an 
average of delayed gradients from the entire tree, with each gradient having a potentially different 
delay, giving rise to updates of the form (fT2"j) or (flUj) . 

4.1 Convergence rates for delayed distributed minimization 

Having described our architectures, we can now give corollaries to the theoretical results from the 
previous sections that show it is possible to achieve asymptotically faster rates (over centralized 
procedures) using distributed algorithms even without imposing synchronization requirements. We 
allow workers to pipeline updates by computing asynchronously and in parallel, so each worker can 
compute low variance estimate of the gradient V/(x). 

We begin with a simple corollary to the results in Sec. 13.11 We ignore the constants L, G, R, 
and a, which are not dependent on the characteristics of the network. We also assume that each 
worker uses m independent samples of £ ~ P to compute the stochastic gradient as 

1 m 

0i(t) = -£VF(s(t);&tf)). 

Using the cyclic protocol as in Fig. [H Theorems Q] and [2] give the following result. 

Corollary 2. Let ip(x) = \ \x\\^, assume the conditions in Corollary d and assume that each 
worker uses m samples £ ~ P to compute the gradient it communicates to the master. Then with 
the choice rj(t) = VT/y/m either of the updates (0|) or I110\) satisfy 

( B 2 1 B 2 m\ 
E[f(x(T))} - f(x*) = 0^— + -j= + —) . 
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Proof The corollary follows straightforwardly from the realization that the variance a 2 



E{\\Vf(x)- gi (t)\\ 2 2 ] = E[||V/(x) 
stochastic gradient samples. 



*\7F(x,£)\\^\/m = 0{l/m) when workers use m independent 

□ 



In the above corollary, so long as the bound on the delay B satisfies, say, B = c^T 1 / 4 ), then the last 
term in the bound is asymptotically negligible, and we achieve a convergence rate of 0(l/VTm). 

The cyclic delayed architecture has the drawback that information from a worker can take 
0(n) time to reach the master. While the algorithm is robust to delay and does not need lock-step 
coordination of workers, the downside of the architecture is that the essentially n 2 m/T term in 
the bounds above can be quite large. Indeed, if each worker computes its gradient over m samples 
with m ~ n — say to avoid idling of workers — then the cyclic architecture has convergence rate 
0(n 3 /T + 1/y/nT). For moderate T or large n, the delay penalty n 3 /T may dominate 1/y/nT, 
offsetting the gains of parallelization. 

To address the large n drawback, we turn our attention to the locally averaged architecture 
described by Figs. [2] and El where delays can be smaller since they depend only on the height 
of a spanning tree in the network. The algorithm requires more synchronization than the cyclic 
architecture but still performs limited local communication. Each worker computes gi{t — t{i)) = 
VF(x{t — r(z)); where r(i) is the delay of worker i from the master and £j ~ P. As a result of 
the communication procedure, the master receives a convex combination of the stochastic gradients 
evaluated at each worker i, for which we gave results in Section [3.21 

In this architecture, the master receives gradients of the form g\(t) = Y^h=i ^idift ~ r W) f° r 
some A in the simplex, which puts us in the setting of Theorems [3] and HI We now make the 
reasonable assumption that the gradient errors V/(x(i)) — gi{t) are uncorrelated across the nodes 
in the networkU In statistical applications, for example, each worker may own independent data or 
receive streaming data from independent sources; more generally, each worker can simply receive 
independent samples £j ~ P. We also set ip(x) = 



\x\\ 2 , and observe 



E 



£ X t Vf(x(t - r(i))) - 9i (t - r(i)) = £ A ' E U V /(^ " *"(0)) " 9i(t - r( 



12 ' 



2 

i=l 2 i=l 

This gives the following corollary to Theorems [3] and 21 

Corollary 3. Set \ = ^ for all i, tp(x) = \ \x^, and rj(t) = (Ty/i + r/Ry/n. Let f and r 2 denote 
the average of the delays r(i) and r(i) 2 , respectively. Under the conditions of Theorem^ orj^J 



E 



£ f(x(t + 1)) - Tf(x*) 



t=i 



n 



The log T multiplier can be reduced to a constant if we set rj(t) = ayT/Ry/n. By using the averaged 
sequence x(T) ([6]), Jensen's inequality gives that asymptotically E[/(x(T))] — f(x*) = 0(l/vfn), 
which is an optimal dependence on the number of samples £ calculated by the method. We also 
observe in this architecture, the delay r is bounded by the graph diameter D, giving us the bound: 



E 



J2f(x(t + 1))-Tf(x* 



t=i 



o( L R 2 + DGR + LG R 2 nD logT + ^Vt) . (14) 



"Similar results continue to hold under weak correlation. 
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The above corollaries are general and hold irrespective of the relative costs of communication 
and computation. However, with knowledge of the costs, we can adapt the stepsizes slightly to give 
better rates of convergence when n is large or communication to the master node is expensive. For 
now, we focus on the cyclic architecture (the setting of Corollary [2]) , though the same principles 
apply to the local averaging scheme. Let C denote the cost of communicating between the master 
and workers in terms of the time to compute a single gradient sample, and assume that we set 
m = Cn, so that no worker node has idle time. For simplicity, we let the delay be non-random, so 
B = t = n. Consider the choice n(t) = n\jT j (Cn) for the damping stepsizes, where 77 > 1. This 
setting in Theorem Q] gives 

W(T))1 -/(,-)= O + 4^ + 'U O f M + » 



V T ^TCn~ nVTCn~J V T VTCn~ 
where the last equality follows since 77 > 1. Optimizing for n on the right yields 

^^{j*^ 1 } and E ^ r ))]-/^) = °( min {^Y} + 7^)- (15) 

The convergence rates thus follow two regimes. When T < ra 7 /C 3 , we have convergence rate 
0(n 2//3 /T 2 / 3 ), while once T > n 7 /C 3 , we attain 0(\/yjTCri) convergence. Roughly, in time 
proportional to TC, we achieve optimization error 1/y/TCn, which is order-optimal given that we 
can compute a total of TCn stochastic gradients [ABRWIOj. The scaling of this bound is nicer than 
that previously: the dependence on network size is at worst n 2 / 3 , which we obtain by increasing the 
damping factor rj(t) — and hence decreasing the stepsize a(t) = 1 / (L + n(t)) — relative to the setting 
of Corollary [2j We remark that applying the same technique to Corollary [3] gives convergence rate 
scaling as the smaller of 0((D/T) 2 / 3 + l/^TCn) and 0((nCD/T + 1/y/TCn). Since the diameter 
D < n, this is faster than the cyclic architecture's bound ()15p . 



4.2 Running-time comparisons 

Having derived the rates of convergence of the different distributed procedures above, we now 
explicitly study the running times of the centralized stochastic gradient algorithms (jl]) and ([5]), 
the cyclic delayed protocol with the updates © and (fTUj) . and the locally averaged architecture 
with the updates (|12j) and (|13|) . To make comparisons more cleanly, we avoid constants, assuming 
without loss that the variance bound a 2 on E ||V/(x) — VF(x;^)\\ 2 is 1, and that sampling £ ~ P 
and evaluating VF(x;£) requires one unit of time. Noting that E[VF(x;£)] = V/(x), it is clear 
that if we receive m uncorrelated samples of £, the variance E|| V/(cc) — ^ Sj=i — m" 

Now we state our assumptions on the relative times used by each algorithm. Let T be the 
number of units of time allocated to each algorithm, and let the centralized, cyclic delayed and 
locally averaged delayed algorithms complete T cent , T cyc i e and T^ ist iterations, respectively, in time 
T. It is clear that T cent = T. We assume that the distributed methods use m cyc i e and m^ist samples 
of £ ~ P to compute stochastic gradients and that the delay r of the cyclic algorithm is n. For 
concreteness, we assume that communication is of the same order as computing the gradient of 
one sample VF(x;£) so that C = 1. In the cyclic setup of Sec. 13. 1\ it is reasonable to assume 
that 7n cyc i e = Q(n) to avoid idling of workers (Theorems [T] and [21 as well as the bound (|15p . show 
it is asymptotically beneficial to have m cyc i e larger, since <r 2 ycle = l/m cyc \ c ). For m cyc i = Q(n), 
the master requires m< ^ clc units of time to receive one gradient update, so m ^ cle T cyc i e = T. In 
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Centralized ([1 [5} 






Cyclic P [TO]) 


/ /n 2 / 3 n 3 \ 1 


) 


Local CE2 H3]) 


Ef(x)-f(x*) = o(mm(^,^ 


1 + vnr) 



Table 1. Upper bounds on E/(x) — f(x*) for three computational architectures, where x is the 
output of each algorithm after T units of time. Each algorithm runs for the amount of time it takes 
a centralized stochastic algorithm to perform T iterations as in (fT6|) . Here D is the diameter of the 
network, n is the number of nodes, and r 2 = — y^L] T(i) 2 is the average squared communication 
delay for the local averaging architecture. Bounds for the cyclic architecture assume delay r = n. 



the locally delayed framework, if each node uses mdist samples to compute a gradient, the master 
receives a gradient every mdist units of time, and hence mdist ^dist = T. Further, a^ ist = l/m-dist- 
We summarize our assumptions by saying that in T units of time, each algorithm performs the 
following number of iterations: 

Tn T 

Tcent = T, T cyc ie = , and Tdist = • (16) 

^cycle ^dist 

Plugging the above iteration counts into the earlier bound (JSJ) and Corollaries [2] and [3] via the 
sharper result (j!5j> . we can provide upper bounds (to constant factors) on the expected optimization 
accuracy after T units of time for each of the distributed architectures as in Table [TJ Asymptotically 
in the number of units of time T, both the cyclic and locally communicating stochastic optimization 
schemes have the same convergence rate. However, topological considerations show that the locally 
communicating method (Figs. [2] and [3]) has better performance than the cyclic architecture, though 
it requires more worker coordination. Since the lower order terms matter only for large n or small 
T, we compare the terms n 2 / 3 /T 2//3 and D 2 / 3 /T 2 / 3 for the cyclic and locally averaged algorithms, 
respectively. Since D < n for any network, the locally averaged algorithm always guarantees better 
performance than the cyclic algorithm. For specific graph topologies, however, we can quantify the 
time improvements: 

• n-node cycle or path: D = n so that both methods have the same convergence rate. 

• ym'-by-y/n grid: D = y/n, so the distributed method has a factor of n 2 / 3 /?! 1 / 3 = n 1 / 3 im- 
provement over the cyclic architecture. 

• Balanced trees and expander graphs: D = O(logn), so the distributed method has a factor — 
ignoring logarithmic terms — of n 2 / 3 improvement over cyclic. 

Naturally, it is possible to modify our assumptions. In a network in which communication 
is cheap, or conversely, in a problem for which the computation of V-F(x; £) is more expensive 
than communication, then the number of samples £ ~ P for which which each worker computes 
gradients is small. Such problems are frequent in statistical machine learning, such as when learning 
conditional random field models, which are useful in natural language processing, computational 
biology, and other application areas [LMPOlj . In this case, it is reasonable to have m cyc i e = 
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1 2 3 4 5 6 8 10 12 15 18 22 26 



Number of workers 

Figure 4. Optimization performance of the delayed cyclic method (J9j) for the Reuters RCV1 dataset 
when we assume that the cost of communication to the master is the same as computing the gradient 
of one term in the objective (|17l) . The number of samples m computed is equal to n for each worker. 
Plotted is the estimated time to e-accuracy as a function of number of workers n. 



0(1), in which case T cyc i e = Tn and the cyclic delayed architecture has stronger convergence 
guarantees of 0{mm{n 2 /T, 1/T 2 / 3 } + 1/Vfn). In any case, both non-centralized protocols enjoy 
significant asymptotically faster convergence rates for stochastic optimization problems in spite of 
asynchronous delays. 

5 Numerical Results 

Though this paper focuses mostly on the theoretical analysis of the methods we have presented, it 
is important to understand the practical aspects of the above methods in solving real-world tasks 
and problems with real data. To that end, we use the cyclic delayed method (|12p to solve a common 
statistical machine learning problem. Specifically, we focus on solving the logistic regression problem 

1 N 

min f(x) = — y log(l + exp(— b{ (aj,x))) subject to llscIL < R. (17) 
x N ^ — ' 

i=i 

We use the Reuters RCV1 dataset [LYRL04J, which consists of N 800000 news articles, each 
labeled with some combination of the four labels economics, government, commerce, and medicine. 
In the above example, the vectors a; S {0, l} d , d ~ 10 5 , are feature vectors representing the words 
in each article, and the labels b{ are 1 if the article is about government, —1 otherwise. 

We simulate the cyclic delayed optimization algorithm Q for the problem (|17p for several 
choices of the number of workers n and the number of samples m computed at each worker. We 
summarize the results of our experiments in Figure [U To generate the figure, we fix an e (in this 
case, e = .05), then measure the time it takes the stochastic algorithm ([9]) to output an x such that 
fix) < inf-rg^ f( x ) + e - We perform each experiment ten times. 

After computing the number of iterations required to achieve e-accuracy, we convert the results 
to running time by assuming it takes one unit of time to compute the gradient of one term in the sum 
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defining the objective (|17p . We also assume that it takes 1 unit of time, i.e. C = 1, to communicate 
from one of the workers to the master, for the master to perform an update, and communicate back 
to one of the workers. In an n node system where each worker computes m samples of the gradient, 
the master receives an update every max{^,l} time units. A centralized algorithm computing 
m samples of its gradient performs an update every m time units. By multiplying the number of 
iterations to e-optimality by max{^, 1} for the distributed method and by m for the centralized, 
we can estimate the amount of time it takes each algorithm to achieve an e-accurate solution. 

We now turn to discussing Figure [H The delayed update ([9]) enjoys speedup (the ratio of time 
to e-accuracy for an n-node system versus the centralized procedure) nearly linear in the number n 
of worker machines until n > 15 or so. Since we use the stepsize choice rj(t) oc y/t/n, which yields 
the predicted convergence rate given by Corollary [21 the n 2 m/T ~ n 3 /T term in the convergence 
rate presumably becomes non-negligible for larger n. This expands on earlier experimental work 
with a similar method [LSZ09] , which experimentally demonstrated linear speedup for small values 
of n, but did not investigate larger network sizes. Roughly, as predicted by our theory, for non- 
asymptotic regimes the cost of communication and delays due to using n nodes mitigate some of the 
benefits of parallelization. Nevertheless, as our analysis shows, allowing delayed and asynchronous 
updates still gives significant performance improvements. 



6 Delayed Updates for Smooth Optimization 

In this section, we prove Theorems [T] and [2j We collect in Appendix [X] a few technical results 
relevant to our proof; we will refer to results therein without comment. Before proving either 
theorem, we state the lemma that is the key to our argument. Lemma [J] shows that certain 
gradient-differencing terms are essentially of second order. As a consequence, when we combine the 
results of the lemma with Lemma which bounds E[||x(i) — x(t + t)|| 2 ], the gradient differencing 
terms become O(logT) for step size choice n(t) oc or 0(1) for rj(t) = n\ 



Lemma 4. Let assumptions L"4l andW\ on the function f and the compactness assumption [Cl hold. 
Then for any sequence x(t) 

T j T 

J2 <V/(s(i)) - Vf(x(t - r)),x(t + 1)- X *)<-J2 Mt - t) - x(t + 1)|| 2 + 2tGR. 
t=\ t=i 

Consequently, f/E[||x(t) — x(t+ 1)|| 2 ] < K(t) 2 G 2 for a non-increasing sequence nit), 



E 



t=i 



£ (Vf(x(t)) - Vf(x(t - r)), x(t + 1) - x*} < { --^- ^ K(t - r) 2 + 2tGR. 



t=l 



Proof The proof follows by using a few Bregman divergence identities to rewrite the left hand 
side of the above equations, then recognizing that the result is close to a telescoping sum. Recalling 
the definition of a Bregman divergence ([2|), we note the following well-known four term equality, a 
consequence of straightforward algebra: for any a, b, c, d, 

(V/(o) - V/(6), c -d) = D f (d, a) - D f (d, b) - D f (c, a) + D f (c, b). (18) 
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Using the equality ([Taj) , we see that 

(Vf(x(t)) - Vf(x(t - r)),x(t + 1) - x*) 

= D f {x*,x(t)) - D f (x*,x(t - r)) - D f (x(t + l),x(t)) + D f (x(t + 1), x(t - r)). (19) 
To make f)19|) useful, we note that the Lipschitz continuity of V/ implies 

f(x(t + 1)) < f(x(t - r)) + (V/(x(i - r)), x(t + 1) - x(t - r)) + | - r) - x(i + 1)|| 2 
so that recalling the definition of Df ([5]) we have 

£>/(x(i + l),x(i-r)) < ^||x(t-r)-x(t + l)|| 2 . 
In particular, using the non-negativity of Df(x,y), we can replace (QjJJ) with the bound 

(V/(x(t)) - Vf(x(t-r)),x(t + 1) - x*) < Df{x*,x{t))-D f (x*,x{t-T))+~ \\x(t - r) - x(i + 1)|| 2 . 
Summing the inequality, we see that 

^ (V/(x(t)) - Vf(x(t - r)),x(t + l)-x*)< J2 D f (x\x(t))+^ ^ \\x(t - r) - x(t + 1)|| 2 . 

t=l t=T-r+l i=l 

(20) 

To bound the first Bregman divergence term, we recall that by Assumption [C] and the strong 
convexity of ip, \\x* — x(t)\\ 2 < 2D^(x*, x(t)) < 2R 2 , and hence the optimality of x* implies 

D f (x*,x(t)) = f(x*)-f(x(t)) - (Vf(x(t)),x*-x(t)) < \\Vf(x(t))\l \\x*-x(t)\\ < 2GR. 

This gives the first bound of the lemma. For the second bound, using convexity, we see that 



;(< - r) - x(t + 1)|| 2 < (r + l) 2 ^ -L- \\x(t - s) - x(t - s + 1)|| 2 

s=0 T ' 



so by taking expectations we have E[||x(t) — x(t + r + 1)|| ] < (r + l) 2 n{t — t) 2 G 2 . Since k is non- 
increasing (by the definition of the update scheme) we see that the sum (|20p is further bounded by 
2tGR + \ Y%=\ ° 2 ( T + 1 ) 2k (* - T ) 2 as desired. □ 



6.1 Proof of Theorem [Q 

The essential idea in this proof is to use convexity and smoothness to bound f(x(t)) — f(x*), then 
use the sequence which decreases the stepsize a(t), to cancel variance terms. To begin, we 

define the error e(t) 

e(t) :=Vf(x(t))-g(t-r) 

where g(t—r) = VF(x(t— r); for some ~ P. Note that e(t) does not have zero expectation, 
as there is a time delay. 
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By using the convexity of / and then the L-Lipschitz continuity of V/, for any x* G X, we have 
f{x{t)) - f(x*) < <V/(x(t)), x(t) - x*) = (Vf(x(t)),x(t + 1) - x*) + (V/(x(i)),x(i) - x(t + 1)) 
< <V/(x(£)), x(t + 1) - x*) + f(x(t)) - f(x(t + 1)) + | \\x(t) - x(t + 1)|| 2 , 

so that 

f(x(t + 1)) - f{x*) < <V/(x(t)), s(f + 1) - x*) + | ||x(i) - x(t + 1)|| 2 
= ( 5 (t - r), x(t + 1) - x*) + (e(t),x(t + l)- x *) + ~ \\ x (t) - x(t + 1)|| 2 



L 



= (z(t + l),x(t + 1) - x*) - (*(*), x(t + 1) - x*) + (e(t),x(t + 1) - x*) + - \\x(t) - x(t + 1) 

Now, by applying Lemma [5] in Appendix [A] and the definition of the update we see that 

1 1 



a{t) 



- (z(t),x(t + 1) - x*) < - (z(t),x(t) - X*) 

which implies 
/(x(i + l))-/(x*) 

< (z(t + 1), x(t + 1) - x*) - (z(t), x(t) - X*) + 



(x(t + 1)) - 4>(x(t))} - —D^x(t + 1), x(t)), 

a{t) 



a(t) 1 



(x(t + l))-V(x(t))] 



LD^{x{t + 1), x(i)) - r](t)D^(x(t + 1), x(i)) + - ||x(t) - s(t + 1)|| + (e(t),x(t + 1) - x*) 



< (z(t + 1), x(t + 1) - x*) - («(t),ar(t) - x*) + 



a{t) 1 



(x(t + 1)) - il>(x(t))] 



rj(t)D f (x(t + 1), x(t)) + (e(t),x(t + 1) - x*) 



(21) 



To get the bound (|21l) . we substituted = L + 77 (i) and then used the fact that ip is strongly 

convex, so D^(x{t + l),x(t)) > ~ ||x(i) — x(i + 1)|| 2 - By summing the bound (f2Tj) . we have the 
following non-probabilistic inequality: 



£/(x(i + l))-/(x*) 
t=i 

1 T 

< <z(T + 1), x(T + 1) - x*) + -j=^(x(T + 1)) + 



a(T) 

T T 

v(t)D^(x(t + 1), x(t)) + ^ <e(t), x(t + 1) 



t=i 



a(i-l) a(i) 



i=l 



t=l 



<^T)^*) + E^(*)) 



a(t - 1) a{t) 



Y,v(t)D^(x(t + l),x(t)) 



t=l 



+ £(e(i),x(t + l) 



(22) 



t=i 
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since ip{x) > and x(T + 1) minimizes (z(T + 1), x) + Q ^ r 1 +1 ^ (x). What remains is to control the 
summed e(t) terms in the bound (|22p . We can do this simply using the second part of Lemma SI 
Indeed, we have 



J2(e(t),x(t + l)-x*) (23) 
t=l 

T T 

= <V/(x(t)) - V/(x(t - r)),z(t + 1) - x*) + £ (V/(x(t - r)) - 5 (t - T ),x(t + 1) - x*> . 



We can apply Lemma 2] to the first term in (j23|) by bounding \\x(t) — x(t + 1)|| with Lemma [71 



Since 77(f) oc yT+T, Lemma[7Jwith t = r implies E[||x(i) — x(f + 1)|| 2 ] < ^nw- As a consequence, 



E 



V (V/(x(i)) - V/(x(t - r)),x(t + 1) - x*) < 2rGi? + 2L(r + 1) 2 G 2 V 1 



What remains, then, is to bound the stochastic (second) term in (|23p . This is straightforward, 
though: 

(Vf(x(t-r))-g(t-r),x(t + l)-x*} 

= (V/(x(f - r)) - g(t - r), x(i) - x*) + (V/(x(f - r)) - 5 (t - r), x(f + 1) - x(i)) 

< <V/(x(f - r)) - <?(i - r),x(i) - x*) + ^ \\Vf(x(t - r)) - g(t - r)|| 2 + ^ ||x(i + 1) - x(t)|| 2 

1 1 1 1 1 2 1 1 1 1 1 2 

by the Fenchel- Young inequality applied to the conjugate pair i l)-)^ and g INI ■ I n addition, 
V/(x(i — t)) — <?(i — r) is independent of x(t) given the sigma-field containing g(l), . . . ,g(t—r—l), 
since x(t) is a function of gradients to time t — r — 1, so the first term has zero expectation. Also 
recall that E[||V/(z(t - r)) - 5 (* - r)||J 2 is bounded by a 2 by assumption. Combining the above 
two bounds into (j23|) . we see that 

^E[(e(t),x(t + l)-x*>] 
t=i 

^yE^j + ^E II^C* + 1) - -(^)ll 2 + 2LG V + !) 2 E ^T7j2 + ^GR. ( 24 ) 

Since A/,(x(t + l),x(i)) > \ ||x(i) - x(t + 1)|| 2 , combining (jUJ with ([22]) and noting that 
z, 1 ^ Jrv < gives 

a(t— 1) o(t) — ° 

E + 1)) - /OO < ^1)^*) + T E + 2LG V + !) 2 E ^7)2 + ^gr. 



t=l 



6.2 Proof of Theorem |2j 

The proof of Theorem [2] is similar to that of Theorem [TJ so we will be somewhat terse. We define 
the error e(t) = V/(x(i)) — g(t — t), identically as in the earlier proof, and begin as we did in the 
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proof of Theorem [TJ Recall that 

f(x(t + 1)) - f(x*) < (g(t - r),x(t + 1) - x*} + (e(t),x(t + 1) - x*) + | ||x(i) - x(t + 1)|| 2 . (25) 
Applying the first-order optimality condition to the definition of x{t + 1) (0), we get 

(a(i)s(i - r) + Vtp{x{t + 1)) - Vtp(x{t)),x - x{t + 1)> > 
for all x £ X. In particular, we have 

a(t) (g(t - T),x(t + 1) - x*) < {Vip(x(t + 1)) - Vip(x(t)),x* - x(t + 1)) 

= A/,(x*,x(t)) - D^(x*,x(t + 1)) - D^(x(t + l),x(t)). 

Applying the above to the inequality (j25j) . we see that 

/(z(t + l))-/(x*) 

< [A/>« x(t)) - D^{x*,x{t + 1)) - D^(x(t + l),x(t))] + (e(t), x(i + 1) - x*> + | ||x(i) - x(t + 1 



< 



1 



a(t) 



[A/,(x*, x(i)) - D^(x*,x(t + 1))] + (e(t), x(i + 1) - x*) - rj{t)D^,{x(t + 1), x(t)) 



(26) 



where for the last inequality, we use the fact that D^(x(t + l),x(t)) > \ ||x(i) - x(t + 1)|| 2 , by the 
strong convexity of tp, and that ait)^ 1 = L + 77(f). By summing the inequality ([26]) . we have 



T 1 T 

£ /(x(t + 1)) - /(x*) < Z^(x*, x(l)) + ^ A/,(z*, x(i)) 



1 



1 



a(t) a{t - 1) 



v(t)D^,(x(t + l),x(i)) + Yl < e W' + !) 



(27) 



t=l 



t=l 



Comparing the bound (|27|) with the earlier bound for the dual averaging algorithms (|22|) . we see 
that the only essential difference is the a(t)~ l — a(t — 1) _1 terms. The compactness assumption 
guarantees that D^(x* , x{t)) < R 2 , however, so 



£z^(s*,s(t)) 



t=2 



1 



a(t) a{t - 1) 



< 



R 2 



airy 



The remainder of the proof uses Lemmas [7] and [T] completely identically to the proof of Theorem [TJ 



6.3 Proof of Corollary [TJ 

We prove this result only for the mirror descent algorithm (jlOp , as the proof for the dual-averaging- 
based algorithm ([9]) is similar. We define the error at time t to be e(t) = Vf(x(t)) — g(t — r(t)), and 
observe that we only need to control the second term involving e(t) in the bound (|26p differently. 
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Expanding the error terms above and using Fenchel's inequality as in the proofs of Theorems Q] 
and [21 we have 

(e(t),x{t + l) -x*) 

< (Vf(x(t)) - Vf{x(t - T(t))),x(t + 1) - x*) + (Vf(x(t - r(i))) - g(t - r(t)),x(t) - x*) 
+ ||V/(s(f - r(i))) - tff - r(t))\\l + ^ ||x(t + 1) - x(t)f , 

Now we note that conditioned on the delay r(t), we have 

E[\\x(t - r(t)) - x(t + 1)|| 2 | r(t)] < G 2 (r{t) + l) 2 a{t - r(t)) 2 . 
Consequently we apply Lemma 0] (specifically, following the bounds (fT9l) and ([201) ) and find 



T 

£ <V/(x(t)) - V/(x(t - r(t))), x(i + 1) - x*) 
t=i 

< J2 [Df{x\x{t)) - D f {x\x(t - r{t)))\ + G 2 f>(t) + l?a{t - r{t)f . 



t=l t=i 



The sum of Df terms telescopes, leaving only terms not received by the gradient procedure within 
T iterations, and we can use a(t) < —^= for all t to derive the further bound 

^ D f (x\x(t)) + ^Y(r(t) + l) 2 . (28) 

t:t+r(t)>T ^ t=l 

To control the quantity ([28]) . all we need is to bound the expected cardinality of the set {t £ 
[T] :t + r(t) > T}. Using Chebyshev's inequality and standard expectation bounds, we have 

E [card({t e[T]:t + r(t) > T})) = J>(t + r(t) > T) < 1 + £ < 1 + 2B 2 , 

t=l t=i 1 Cj 

where the last inequality comes from our assumption that E[r(t) 2 ] < B 2 . As in Lemma 01 we have 
D f (x*,x(t)) < 2GR, which yields 



E 



£ (V/(x(i)) - V/(x(t - r(t))), x(i + 1) - x* 



7/ 2 



We can control the remaining terms as in the proofs of Theorems Q] and [2j 

7 Proof of Theorem [3] 

The proof of Theorem [3] is not too difficult given our previous work — all we need to do is redefine 
the error e{t) and use rj(t) to control the variance terms that arise. To that end, we define the 
gradient error terms that we must control. In this proof, we set 

n 

e(t) := V/(x(t)) - hoS ~ r(i)) (29) 
i=l 
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where gi(t) = V /(a?(f);£i(i)) is the gradient of node % computed at the parameter x(t) and r(i) is 
the delay associated with node i. 

Using Assumption iBl as in the proofs of previous theorems, then applying Lemma [5l we have 

f(x(t + 1)) - f(x*) < (Vf(x(t)),x(t + 1) - x*} + ~ ||x(i) - x{t + l)f 
= (Y^Xigiit - T(i)),x(t + 1) - x*\ + (e{t),x(t + 1) - x*) + | ||x(i) - x(i + 1)|| 2 

= (z(< + l),x(t + 1) - x*) - (z(t),x(t + 1) - x*) + (e(t),x(t + 1) - x*) + - \\x(t) - x{t + 1)|| 2 

< + 1), x(f + 1) - X*) - (Z(t),x(t) - X*) + -f-r^xit + 1)) - -j-ri>(x(t)) 

a{t) a(t) 

1 - . i ^,,2 



D^(x(t + 1), x(t)) + (e(t),x(t + 1) - x*) + - \\x(t) - x(t + 1)||' 



a(t) VK v " v " x v " v ' ' 2 

We telescope as in the proofs of Theorems CD and O canceling -| ||x(i) — x(t + 1)|| 2 with the LD^ 
divergence terms to see that 

T 

T T 

< (z(T + 1), x(T + 1) - x*) + ——^{x{T)) - v(t)DTp(x(t + 1), x(i)) + E (e(t), z(t + 1) - x*) 



a(T) 



t=i t=i 

T 



< a(T 1 +1) ^(^) - E ri(t)D^x(t + 1), x(t)) + £ <e(t), x(i + 1) - x*) . (30) 



t=i t=i 



This is exactly as in the non-probabilistic bound ()22[) from the proof of Theorem [IJ but the defini- 
tion (]29p of the error e(t) here is different. 

What remains is to control the error term in (|30p . Writing the terms out, we have 



E (e(t), x(t + 1) - x*) = E ( V/(x(t)) - g AiV/(z(t " rW)),x(t + 1) - x* \ 
t=i t=i \ i=i / 

+ E (l> [V/(x(t - r(i))) - - r(i))] , s(i + 1) - x*\ (31) 
t=i \i=i / 

Bounding the first term above is simple via Lemma [U as in the proof of Theorem [1] earlier, we have 



E 



T 



E ( - E W(x(i - r(*))), x(t + 1) - x* 

t=l \ i=l / 

n T 

E A, E E KV/(^ (*)) - V/(x(t - r(i))), x(t + 1) - x*)] 



=i t=i 

n 



< 2 E A,LG 2 (r(z) + I) 2 E -TT 1 ^ + E ^WR. 

i=l t=i ' »=l 
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We use the same technique as the proof of Theorem Q] to bound the second term from (J3TJ). 
Indeed, the Fenchel- Young inequality gives 

(J2 A * [ V /(^ - - 9i(t ~ r(*))] , x(t + l)-x*^ 

= (^2 A * [ V /(*(* " r(i))) - - r(i))] , *(t) - x*^ 

+ (E A, [V/(x(t - r(i))) - 9i(t - r(i))] , x(t + 1) - 
M X> [V/(x(t - r(i))) - ^(i - r(i))] , x(t) - x* ) 



ii=l 



+ 



2ry(t) 



EA, [V/(x(t-r(i)))-ft(t-r(i))] 



i=l 



)7(t) 



|x(t + 1) — x 



By assumption, given the information at worker i at time t — r(i), gi{t — r(i))) is independent of 
x(t), so the first term has zero expectation. More formally, this happens because x(t) is a function 
of gradients <?i(l), . . . , gi{t — r(i) — 1) from each of the nodes i and hence the expectation of the first 
term conditioned on {<7i(l), • • • ,gi(t — r(i) — 1)}" =1 is 0. The last term is canceled by the Bregman 
divergence terms in (130|) . so combining the bound (13ip with the above two paragraphs yields 



l . n l - n 

E E/(x(* + 1)) - /(**) < —~^{x*) + 2 E A,LG 2 (t(*) + l) 2 E 7TZ + E *Mi)GR 

a( - Z) i=i t=i ^ rj i=i 



8=1 



8 Conclusion and Discussion 

In this paper, we have studied dual averaging and mirror descent algorithms for smooth and non- 
smooth stochastic optimization in delayed settings, showing applications of our results to distributed 
optimization. We showed that for smooth problems, we can preserve the performance benefits of 
parallelization over centralized stochastic optimization even when we relax synchronization re- 
quirements. Specifically, we presented methods that take advantage of distributed computational 
resources and are robust to node failures, communication latency, and node slowdowns. In ad- 
dition, by distributing computation for stochastic optimization problems, we were able to exploit 
asynchronous processing without incurring any asymptotic penalty due to the delays incurred. In 
addition, though we omit these results for brevity, it is possible to extend all of our expected 
convergence results to guarantees with high-probability. 
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A Technical Results about Proximal Functions 

In this section, we collect several useful results about proximal functions and continuity properties 
of the solutions of proximal operators. We give proofs of all uncited results in Appendix [BJ We 
begin with results useful for the dual-averaging updates ((H) and ([9]). 
We define the proximal dual function 

r a (z) := sup ((-z,x) - -tftx)) . (32) 



Since Vtp^(z) = axgmax xeX {(— z, x) — a~ 1 tp(x)}, it is clear that x(t) = V ip* a ^(z{t)) . Further by 
strong convexity of if), we have that Vtp^(z) is a-Lipschitz continuous [Nes09|, lHUL96bl Chapter 
X], that is, for the norm ||-|| with respect to which ijj is strongly convex and its associated dual 
norm W'W^,, 

||VC(2/)-W*(*)|| <a\\y-zt. (33) 

We will find one more result about solutions to the dual averaging update useful. This result has 
essentially been proven in many contexts [Nes09l lTse08t iDGBSXIOa]. 

Lemma 5. Let x + minimize {z, x) + Aip(x) for all x G X . Then for any x G X , 

(z, x) + Aip(x) > (z, x + ) + Aip(x + ) + AD^(x, x + ) 

Now we turn to describing properties of the mirror-descent step ([5]), which we will also use 
frequently. The lemma allows us to bound differences between x(t) and x{t + 1) for the mirror- 
descent family of algorithms. 

Lemma 6. Let x + minimize (g,x) + —D^(x,y) overx^X. Then \\x + — y\\ < a \\g\\*- 

The last technical lemma we give explicitly bounds the differences between x(t) and x{t + r), 
for some r > 1, by using the above continuity lemmas. 



Lemma 7. Let Assumption\A\ hold. Define x(t) via the dual- averaging updates Q), (0), or U2\) 
or the mirror- descent updates $5\), ( fiOj) . or il3\) . Let a(t) _1 = L + r](t + io) c f or some c G [0, 1], 
i] > 0, to > 0, and L > 0. Then for any fixed r, 

E[||x(t)-x(t + r)|| 2 ] < -.^ \ .„ and E[\\x(t) - x(t + r)\\] < ~ ' 



B Proofs of Proximal Operator Properties 

Proof of Lemma [6] The inequality is clear when x + = y, so assume that x + ^ y. Since x~ 
minimizes (g,x) + —D^p(x,y), the first order conditions for optimality imply 

(ag + Vif)(x + ) - Vip(y),x - x + ) > 
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for any x € X. Thus we can choose y = x and see that 

a(g,y -x) > (Vip(x + ) - V^(y),x + - y) > \\x + - y\\ 2 , 

where the last inequality follows from the strong convexity of ip. Using Holder's inequality gives 
that a \\g\\^ \\y — x\\ > \\x + — y\\ 2 , and dividing by \\y — x\\ completes the proof. □ 



Proof of Lemma [7] We first show the lemma for the dual-averaging updates. Recall that 
x(t) = Vip^ t j(z(t)) and V^* is a-Lipschitz continuous. Using the triangle inequality, 

\\x(t) - x(t + t)\\= \vr a{t) (z(t)) - Vr a(t+T) (z(t + r)) 

VC(t)(^(0) - VC ((+T) (2(i)) + (*(*)) - Vr a{t+T) (z(t + r)) 

VC(t)(^(*)) - VC(t+r)(^(*))|| + I VVa(t+r) (*(*)) - VC( 1+ r)(^(* + T)) 

< (a(t) - a(t + r)) ||z(t)||, + a(t + r) \\z(t) - z(t + r)||„ . (34) 
It is easy to check that for c G [0, 1], 



< 



a(t) - a(t + r) < 



CT]T 



< 



err 



(L + r]t c ) 2 t l ~ c - r]t 1+c ' 
By convexity of ||-|| 2 , we can bound E[||z(t) — z(t + r)|| J: 



E[\\z(t) - z(t + t) 



L 2 1 = r 2 E 



I T 

-^J z(t + s) - z(t + s - 1) 



r 2 E 



s=0 



< r 2 G 2 , 



since E[||9F(x; £)ll*] < ^ 2 by assumption. Thus, bound (f34"j) gives 



E[||x(t) - s(t + r)|| 2 ] < 2(a(t) - a(t + r)) 2 E[||z(t)|| 2 ] + 2a(t + r) 2 E[||z(t) - z(t + r) 



< 



2c 2 t 2 r 2 G 2 ^ 9 , s0 2c 2 r 2 G 2 2G 2 r 2 



+ 2GVa(t + r) 



+ 



f^*2+2c ' ~~ ' " V ' ■> v 2 t 2c (L + T){t + t) c ) 2 ' 

where we use Cauchy-Schwarz inequality in the first step. Since c < 1, the last term is clearly 
bounded by 4G 2 r 2 /ift 2c . 

To get the slightly tighter bound on the first moment in the statement of the lemma, simply 
use the triangle inequality from the bound (I34p and that VEX 2 > E|X|. 

The proof for the mirror-descent family of updates is similar. We focus on non-delayed up- 
date ([5]), as the other updates simply modify the indexing of g(t + s) below. We know from 
Lemma [6] and the triangle inequality that 

T T 

\\x(t) - x(t + r)|| < J2 M* + s)-x(t + s-l)\\<Y,a(t + s-l) \\g(t + s)\l 

s=l s=l 

Squaring the above bound, taking expectations, and recalling that a(t) is non-increasing, we see 

r r 

E[\\x(t) - x(t + r)|| 2 ] < ]T Yl ^ + S M* + r ) E tll5(* + S )L \\9(t + r)U 



=1 r=l 



< r*a(*) 2 max JE[\\g(t + */E[[|0(t + r)||3 < r 2 a{tfG 2 
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by Holder's inequality. Substituting the appropriate value for a(t) completes the proof. 



□ 



C Error in [LSZ09] 



Langford et al. |LSZ09] . in Lemma 1 of their paper, state an upper bound on {g(t — r),x(t — r) — x*) 
that is essential to the proofs of all of their results. However, the lemma only holds as an equality 
for unconstrained optimization (i.e. when the set X = M, d ); in the presence of constraints, it fails 
to hold (even as an upper bound). To see why, we consider a simple one-dimensional example with 
X = [—1, 1], f(x) = \x\, rj = 2 and we evaluate both sides of their lemma with r = 1 and t = 2. 
The left hand side of their bound evaluates to 1, while the right hand side is —1, and the inequality 
claimed in the lemma fails. The proofs of their main theorems rely on the application of their 
Lemma 1 with equality, restricting those results only to unconstrained optimization. However, the 
results also require boundedness of the gradients git) over all of X as well as boundedness of the 
distance between the iterates x(t). Few convex functions have bounded gradients over all of M. d ; 
and without constraints the iterates x(t) are seldom bounded for all iterations t. 
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